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GENETIC SYNTHESIS OP NEURAL NETWORKS 

The invention hereof relates to a method for 
using genetic type learning techniques in connection 
with designing a variety of neural networks that are 
5 optimized for specific applications. 

Previous work in the design of neural networks 
has revealed the difficulty in determining an 
appropriate network structure and good values for the 
parameters of the learning rules for specific 
10 applications. 

The genetic algorithm is an optimization method 
based on statistical selection and recombination. The 
method is inspired by natural selection. A few 
researchers (Dolan & Dyer (1987) , Dress & Knisely (1987) 
Davis (1988) , Montana and Davis (1989) and Whitley 
(1988) ) have applied generic algorithms in a limited 
fashion to generate neural networks for specific 
problems. Por example, Davis and Montana (1988, 1989) 
and Whitley (1988) use the genetic algorithm to adjust 
weights given a fixed network structure. 
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In the invention herein a general 
representation of neural network architectures is linked 
with the genetic learning strategy to create a flexible 
environment for the design of custom neural networks. A 
concept upon which the invention is based is the 
representation of a network design as a "genetic 
blueprint" wherein the recombination or mutation of 
subsequently generated editions of such blueprints 
result in different but related network architectures. 

To illustrate the invention there is described 
herein a system for the genetic synthesis of a 
particular class of neural networks that we have 
implemented. Our current implementation is restricted 
to network structures without feedback connections and 
incorporates the back propagation learning rule. The 
invention can, however, be used for arbitrary network 
models and learning rules. 

The method herein involves the use of genetic 
algorithm methods to design new neural networks. The 
genetic algorithm (GA) is a robust function optimization 
method. Its use is indicated over gradient descent 
techniques for problems fraught with local minima, 
discontinuity, noise, or large numbers of dimensions. A 
useful feature of the GA is that it scales extremely 
well, increasing dimensionality has comparatively little 
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effect on performance. The first step in the 
application of the GA to a function is the encoding of 
the parameter space as a string of (typically binary) 
digits, substrings in such a representation correspond 
to parameters of the function being optimized: a 
particular individual bit string (i.e. some choice of l 
of 0 for each position) represents a point in the 
parameter space of the function. The GA considers a 
population of such individuals. The population, in 
conjunction with the value of the function for each 
individual (generally referred to as "fitness") , 
represents the state of the search for the optimal 
string. The GA progresses by implicitly encoding 
information about the function in the statistics of the 
population and using that information to create new 
individuals. The population is cyclically renewed 
according to a reproductive plan. Each new "generation" 
of the population is created by first sampling the 
previous generation according to fitness; the method 
used for differential selection is known to be a 
near-optimal method of sampling the search space. Novel 
strings are created by altering selected individuals 
with genetic operators. Prominent among these is the 
crossover operator which synthesizes new strings by 
splicing together segments of two sampled individuals. 
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A niain object of the invention is to provide a 
new method as referred to above for designing optimized 
artificial neural networks. 

Other objects and advantages of the invention 
will become apparent from the following specification^ 
appended claims and attached drawings. 

In the drawings: 

Fig. 1 illustrates a multilayer neural network 
of the type which may be designed for a specific purpose 
in accordance with the method of the present invention; 

Fig. 2 illustrates schematically how a 
population of "blueprints" (designs for different neural 
networks) is cyclically updated by the genetic algorithm 
based on their fitness; 

Fig. 3 shows schematically an example of a 
three-layer network which may be described by a bit 
string representation in accordance with the invention; 

Fig. 4 illustrates a bit string representation 
which facilitates practicing the invention; 

Fig. 5 illustrates the gross anatomy of a 
network representation having areas or layers o to N; 

Fig. 6 illustrates ah arrangement of areas (or 
layers) and projections extending therebetween; 

Fig. 7 shows the spatial organization of 

layers ; 
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Figs. 8 and 9 show examples of absolute and 
relative addressing for specifying the target 
destinations of projections which extend from one 
layer to another layer; 

Figs. 10 to 12 show Illustrative examples 
of the area specification substring of Fig. 4; 

Figs. 13a, 13b and 13c show projection 
features relating to connections between layers of 
the network; 

Figs. 14a, 14b and 14c show a schematic 
example of a specific network structure generated by 
the method herein at different levels of detail; 

Fig. 15 shows the basic reproductive plan 
used in experiments pursuant tot he method herein; 

Figs. 16a, 16b and 16c show an example of 
the operation of a genetic operator; 

Fig. 17 shows the principle data structures 
in a current implementation program with one 
individual being shown parsed and instantiated; and 

Figures 18 to 21 show performance 'curves 
relating to the rate of learning of networks. 

The method herein relates to the designing 
of multilayer artificial neural networks of the 
general type lo shown in Fig. l. The network 10 is 
illustrated as having three layers (or areas) 12, 14 
and 16 but could have more than three layers or as 
few as one layer if desired. Each of the layers has 
computational units 

SUBSTITUTE SHEET 
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18 joined by connections 19 which have variable weights 
associated therewith in accordance with the teaching of 
the prior art. 

In this and other figure connections are shown 
5 in the forwardly feeding direction. The invention is 

not limited to this construction, however, and feedback 
connections may also be accommodated, for example. 

Also, the scope of the network design method 
disclosed herein is not limited to the design of the 
10 network shown in Fig. i. 

Fig. 2 illustrates schematically how a 
population of blueprints 20 (i.e. bit string designs for 
different neural networks) are cyclically updated by a 
genetic algorithm based on their fitness. The fitness 
15 of a network is a combined measure of its worth on the 

problem, which may taken into account learning speed, 
accuracy and cost factors such as the size and 
complexity of the networks. 

The method begins with a population of randomly 
20 generated bit strings 20. The actual number of such bit 

strings is somewhat arbitrary but a population size of 
30 to 100 seems empirically to be a good compromise 
between computational load, learning rate and genetic 
drift. 
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NEURAL NETWORK LEARNING ALGORITHMS 

Learning approaches for neural networks fall 
into three general categories: unsupervised learning, 
reinforcement learning, and supervised learning, m 
unsupervised learning, the network receives no 
evaluative feedback from the environment; instead it 
develops internal models based on properties of received 
inputs. In reinforcement learning, the environment 
provides a weak evaluation signal. m supervised 
learning the "desired output" for the network is 
provided along with every training input. Supervised 
learning, specifically back propagation, is used to 
illustrate the invention but in concept the invention 
can be used with any learning approach. 

The set of input-output examples that is used 
for supervised learning is referred to as the training 
set. The learning algorithm can be outlined as follows: 
FOR EACH (training-input, desired-output) pair in the 
Training-Set 

o Apply the training-input to the input of 

the network, 
o Calculate the output of the network, 
o IF the output of the network t 

des ired-output 
o THEN modify network weights 
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The entiire loop through the training set, 
referred to as an epoch, is executed repeatedly. One or 
both of two termination criteria are usually used: there 
can be a lower bound on the error over an epoch and/ or a 
limit on the number of epochs. Training a network in 
this fashion is often very time consximing. Until better 
learning techniques become available, it is best to plan 
the training phase as an "off-line" activity. Once 
trained, the network can be put to use. The 
computational demands of such a network during the 
operational phase can usually be satisfied with only 
rudimentary hardware for many interesting applications. 

The neural network learning approach which we 
have currently implemented is the well-known 
backpropagation algorithm. (Werbos, 1974; Le Cun, 1986; 
Parker, 1985; Rumelhart, Hinton & Williams, 1985) . 

The backpropagation algorithm is described in 
Appendix B. 

BLUEPRINT REPRESENTATIONS 

The invention herein is mainly directed to a 
representation of the blueprint 20 that specifies both 
the structure and the learning rule, the genetic 
algorithm parameters that determine how the genetic 
operators are used to construct meaningful and useful 
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network structures, and the evaluation function that 
determines the fitness of a network for a specific 
application. 

The development of a bit string representation 
20 for the neural network architecture of a network lo 
is a major problem with which the concept of the 
invention is involved. Biological neural networks are 
not yet understood well enough to provide clear 
guidelines for synthetic networks and there are many 
different ways- to parameterize network organization and 
operation. 

The representation of blueprints or bit strings 
20 for specialized neural networks should ideally be 
able to capture all potentially "interesting" networks, 
i.e., those capable of doing useful work, while 
excluding flawed or meaningless network structures, it 
is obviously advantageous to define the smallest 
possible search space of network architectures that is 
sure to include the best solution to a given problem. 
An important implication of this goal in the context of 
the genetic algorithm is that the representation scheme 
should be closed under the genetic operators. In other 
words, the recombination or mutation of network 
blueprints should always yield new, meaningful network 
blueprints. There is a difficult trade off between 
expressive power and the admission of flawed or 
uninteresting structures. 
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Fig. 3 shows schematically an example of how 
each layer of a three-layer network may be described in 
accordance with the invention by a bit string 
representation which comprises three substrings 17. The 
format for a single substring 17 is shown in more detail 
in Pig. 4. 

The gross anatomy of a multilayer network 
representation 20 having substring layers or areas 17 
(Area 0 to Area N) is illustrated in Figure 5. 
Conceptually, all of the parameters for a single network 
are encoded in one long string of bits which is the 
representation 20 of Figure 5. The patterned bars are 
markers indicating the start and end of the individual 
area or layer segments 17. 

The term projection as used herein has 
reference to the grouping or organization of the 
connections 19 which extend between the computational 
units 18 of the layers of the networks such as in the 
network illustrations of Figs. 1 and 3. 

In Fig. 1 the input connections to layer 12 
represent a single input projection and the output 
connections extending outwardly from layer 16 represent 
a single output projection. Likewise the connections 19 
between layers 12 and 14 represent a single projection 
and the connections 19 between the layers 14 and 16 
represent a single projection. 
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An example of a projection arrangement for a 
particular network is shown in Pig. 6 with projections 
22 to 28 being illustrated for layers or areas 31 to- 
35. Of interest is that layer 32 has two projections 24 
and 25 extending respectively to layers 33 and 34. Also 
of interest is the opposite arrangement wherein layer 35 
receives projections 26 and 27 from layers 33 and 34 
respectively. 

Each of the projections is represented by three 
lines which signify that each projection consists of a 
predetermined or desired plurality of only the 
connections 19 which extend between two particular 
layers. 

Referring to Fig. 4, it will be apparent that 
15 an area or layer specification substring 17 as 

illustrated in this figure is applicable to each one of 
the layers 12, 14 and 16 of the network 10 of Fig. i. 

A bit string 20 is thus composed of one or more 
segments or substrings 17, each of which represents a 
20 layer or area and its efferent connectivity or 

projections. Each segment is an area specification 
substring 17 which consists of two parts: 

o An area parameter specification (APS) 
which is of fixed length, and 
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parameterizes the area or layer in terms 
of its address, the number of units 18 in 
it, and how they are organized, 
o One or more projection specification 
5 fields (PSFs), each of fixed length. Each 

such field describes a connection from one 
layer to another layer. As the nximber of 
layers is not fixed in this architecture 
(although bounded) , the length of this 
field will increase with the number of 
projection specifiers required, a 
projection (e.g., on the projections 22 to 
28 in Fig. 6) is specified by the address 
of the target area, the degree of 
connectivity and the dimension of the 

4 

projection to the area> etc. 
The fact that there may be any number of areas 
17 motivates the use of markers with the bit string to 
designate the start and end of APSs and the start of 
PSFs. The markers enable a reader program to parse any 
well-formed string into a meaningful neural network 
architecture. The same markers also allow a special 
genetic crossover operator to discover new networks 
without generating "nonsense strings". Markers are 
25 considered "meta-structure" : they serve as a framework 

but don't actually occupy any bits. 
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Fig. 4 shows how the APS and PSF are structured 
in our current representation. The portions of the bit 
string representing individual parameters are labeled 
boxes in the figure. They are substrings consisting of 
some fixed number of bits. Parameters described by an 
interval scale (e.g. o, l, 2. 3, 4) are rendered using 
Gray coding, thus allowing values that are close on the 
underlying scale to be close in the bit string 
representation (Bethke, 1980, Caruana & Schaffer, 1988). 

In the APS, each area or layer has an 
identification number that serves as a name. The name 
need not be unique among the areas of a bit string. The 
input and output areas have the fixed identifiers, o and 
7 in the embodiment herein. 

An area also has a size and a spatial 
organization. The "total size- parameter determines how 
many computational units 18 the area will have, it 
ranges from 0 to 7, and is interpreted as the logarithm 
(base 2) of the actual number of units; e.g. , if total 
size is 5, there are 32 units. The three "dimension 
share" parameters, which are also base 2 logarithms, 
impose a spatial organization on the units. The units 
of areas may have l, 2 or 3 dimensional rectilinear 
extent, as illustrated in Pig. 7. 
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The motivation for this organization comes from 
the sort of perceptual problems to which neural networks 
are apparently well suited. For example, an image 
processing problem may best be served by square array, 
5 while an acoustic interpretation problem might call 

for vectors. The organization of the units in more 
conventional approaches is often left implicit, in the 
invention herein dimensionality has definite 
implications for the architecture of projections such as 
the projections 22 to 28 of Fig. 6. 

The PSFs in an area's segment of the bit string 
determine where the outputs of units in that layer will 
(attempt to) make efferent connections, and how. The 
representation scheme does not assume a simple pipeline 
architecture, as is common. Fig. 6, for example, shows 
a five-area network in which projections split from the 
second area and rejoin in the fifth. 

Each PSF indicates the identity of the target 
area. There are currently two ways it can do this, 
20 distinguished by the value of a binary addressing mode 

parameter in each PSF. in the "Absolute" mode, the 
PSF's address parameter is taken to be the ID number of 
the target area. Some examples of absolute addressing 
are shown in Fig. 8. 

25 
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The "Relative" node indicates that the address 
bits hold the position of the target area in the bit 
string relative to the current area, a relative address 
Of zero refers to the area inunediately following the one 
containing the projection; a relative address of n 
refers to the nth area beyond this, if it exists. 
Relative addresses indicating areas beyond the end of 
the blueprint are taken to refer to the final area of 
the blueprint-the output area. Some examples of 
relative addressing are shown in Fig. 9. 

The purpose of different addressing schemes is 
to allow relationships between areas to develop, and be 
sustained and generalized across generations through the 
genetic algorithm's reproductive plan. Specifically, 
the addressing schemes are designed to help allow these 
relationships to survive the crossover operator, either 
intact or with potentially useful modifications. 
Absolute addressing allows a projection to indicate a 
target no matter where that target winds up in the 
chromosome of a new individual. Relative addressing 
helps areas that are close in the bit string to maintain 
projections, even if their IDs change. 

Referring to Figs, lo to 12, the dimension 
radii parameters (also base 2 logarithms) allow units in 
an area to project only to a localized group of units in 
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the target area. This feature allows the target units 
to have localized receptive fields 29, which are both 
conmion in biological neural networks and highly 
desirable from a hardware implementation perspective. 
5 Even within receptive fields 29, projections between one 

area or layer and another do not necessarily imply full 
factorial connectivity. The connection density 
parameter for the projection may stipulate one of eight 
degrees of connectivity between 30% and 100%. 

At this point it may be well to mention that, 
because of the magnitude of the numbers involved for the 
units 18 and the connections 19, it is contemplated that 
in a typical system the numbers will be represented by 
their logarithms In Figs. lO to 12 and 15 herein which 
show examples of the substring 17, decoded numbers are 
used by way of illustration to facilitate an 
understanding of the concepts. 

Projections include a set of weighted 
connections. The weights are adjusted by a learning 
rule during the training of the network. Parameters are 
included in the PSF to control the learning rule for 
adjusting the weights of the projection. The eta 
parameter controls the learning rate in back propagation 
and may take on one of 8 values between O.l and 12.8, 
25 Eta need not remain constant throughout training. A 
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separate eta-slope parameter controls the rate of 
exponential decay for eta as a function of the training 
epoch. 

An example of how this representation scheme 
can be used to specify a 3-layer network is shown in 
Fig. 3. 

The first and last areas or layers of the 
network have a special status. The first, the input 
area, represents the set of terminals that will be 
"Clamped" by the network's environment, effectively the 
input stimulus. The final area is always the output 
area, and has no projections. 

A blueprint representation in BNP of the neural 
network described herein is Appendix A at the end 
hereof, it is anticipated that there will be future 
modifications and additions to it. 

Figs. 10 to 12 show three examples of 
substrings 17 which illustrate the projection specifier 
sections thereof relative to the radius and the 
connection density parameters. These figures show 
examples of projections 21 from a layer or Area 1 to a 
layer or Area 2. The projection in Fig. lo is from a 
one dimensional area (Area 1) to a two dimensional area 
(Area 2) and the projections in Figs, li and 12 are each 
25 from a one dimensional area (Area 1) to another 

dimensional area (Area 2) . 
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In Fig. 10 the illustrated projection is to an 
8 by 4 projection array 29 of computational units 18 
and, by convention, this array is deemed to have a 
radius of 4 in the vertical direction and a radius of 2 
in the horizontal direction. The object array 29 is 
symmetrically arranged relative to the source xinit 18a 
in Area l. As each unit within the boundary of the 
projection array 19 is connected, the connection density 
parameter is ICQ. 

It will be understood that each of the 
computational units 18 in Area 1 will in a similar 
manner have connections to respective 8x4 projection 
arrays of units in Area 2 which results in substantial 
overlapping of projection arrays and a very dense 
connection system. 

In Fig. 11 the projections are to every other 
one of a linear array of 20 units. The radius is 8 
indicated but the connection density parameter is only 
50 because only half of the units within the radius are 
connected. 

Fig. 12 is similar to Fig. 11 except that every 
computational unit in the array is connected and thus 
the connection density is 100. 

Figs. 11 and 12 are similar to Fig. 10 relative 
to the matter of having each unit in Area 1 connected to 
a respective projection array of units in Area 2. 
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Potential target: units of a projection from 
a given source unit are determined by radii along 
three dimensions. Figures 13a, 13b and I3c are three 
two-dimensional examples of this. 

Figures 14a, 14b and 14c, taken together, 
provide a schematic exanple of a specific network 
structure generated by the method herein. 
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ADAPTING GENETIC ALGORITHMS 

The version of the genetic algorithm used 
in the method herein employs a reproductive plan 
similar to that described by Holland (1975) as "type 
R". The basic plan for generating each new 
generation is given in Fig. 15. The san^ling 
15 algorithm is based on the stochastic universal 

sampling scheme of Baker (1987) . This is preferred 
for its efficiency and lack of bias. State of the 
details are not shown by the diagram, a final step 
was added to insure that the best individual from 
20 generation 1 was always retained in generation i+i. 

The genetic algorithm (GA) itself has a 
number of parameters. Good values for these are 
important to the efficient operation of the system. 
These parameters include the population size, the 
25 rates at which to apply the various genetic 

operators, and other aspects of the synthetic 
ecology. 
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Two genetic operators have been used: 
crossover and mutation. The crossover operator 
effectively exchanges homologous segments from the 
blueprints of two networks from the current generation 
to create a blueprint for a network in the next 
generation. In most applications of the genetic 
algorithm, homologous segments are identifiable by 
absolute positions in the bit string. For example. The 
Nth bit will always be used to specify the same trait in 
any individual. Because the representation herein 
allows variable length strings, a modified two-point 
crossover operator was employed that determined 
homologous loci on two individuals by referring to the 
string's markers, discussed above. The decision to use 
a two-point crossover as opposed to the more common 
single-point version was motivated by Booker's (1987) 
report that improved off-line performance could be 
achieved this way. 

The mutation operator was used at a low rate to 
introduce or reintroduce alleles-alternate forms of the 
same functional gene, current applications of the 
genetic algorithm have demonstrated an effective 
contribution from mutation at rates on the order of 
10"^ or less. 
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Despite the fact that the bit string 
representation was designed with closure under the 
genetic operators in laind, it is still possible for 
the GA to generate individuals that are prima facie 
unacceptable. A blatant exan^le would be a network 
plant that had no pathway of projections from input 
to output. Subtler problems arise from the 
limitations of our simulation capability, in our 
initial work we have limited recurrence; network 
plans with feedback cannot be tolerated under simple 
back propagation. Two strategies have been employed 
for minimizing the burden of these misfits. First, 
the reproductive plan culls individuals with fatal 
abnormalities? individuals with fatal abnormalities; 
individuals with no path from input to output area 
compose the bulk of this group. Second, blueprints 
with minor abnormalities are "purified" in their 
network implementation, i.e. their defects are 
excised. 

Figures 16a, 16b and 16c show an example of 
how the crossover operator can create new strings 
with different values for fields than either of the 
parents. Here it is assumed that the fields use a 
simple binary encoding scheme. 
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EVAIJJATION OF SYNTHESIZED NETWORKS 

Suitable improvements over generations can only 
be accomplished if the evaluation function used to 
measure the fitness of a network is appropriate • A 
measure of fitness is necessary for the GA to produce 
better and better networks • it is helpful to envision 
the algorithm as exploring the surface over the 
blueprint representation space defined by this function 
in an attempt to locate the highest peaks. 

In accordance with the requirements of the 
evaluation function stated above, we have initially 
formulated the evaluation fxmction as a weighted sum of 
the performance metrics, p^. The evaluation function, 
F(i), for individual i can be expressed as: 



10 



15 



20 



25 



F(i) = aj pj (i) 

The coefficients aj may be adjusted by the 
user to reflect the desired character of the network. 
Metrics that have been considered thus far include 
performance factors such as observed learning speed and 
the performance of the network on noisy inputs, and cost 
factors such as the size of the network, and the number 
of connections formed. We have adopted a melange of 
different performance and cost factors since perfornance 
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criteria vary from application to application. Because 
the relative weight on each factor can be modified, the 
network structure can be tuned for different 
optimization criteria. For example, if one of our goals 
is to synthesize networks that are computationally 
efficient, the size metrics might be given negative 
weights, on the other hand, if accuracy and noise 
tolerance is more crucial, then the perfonnance on noisy 
input patterns would be given a higher weight. 

EVALUATION OF GA PERFORMANCE 

In order to make conclusions about the- 
performance of the genetic algorithm (as opposed to the 
networks themselves) in discovering useful 
architectures, we require some standard to compare it 
against. This is difficult since there seems to be no 
published data directly relevant to the problem, our 
approach is to run a control study in which network 
blueprints are generated at random, evaluated, and the 
best retained. This is effected by simply "turning of f" 
the genetic operators of crossover and mutation. Random 
search is an oft employed benchmark of performance that 
other search algorithms must exceed to demonstrate their 
value. 
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DATA STRUCTURES 

The major data structures in a current 
implementation of the invention are objects that are 
created and linked together at run time. The most 
prominent object is the "experiment" which maintains the 
current population, the history of performance over 
generations, as well as various control and interface 
parameters. The performance history is a list of 
records, one per generation, noting, among other things, 
o^^"line, off-line, average and best scores. The 
population comprises the individuals of the current 
generation as shown in Pig. 17. 

Each individual has an associated blueprint, 
which is stored as a bundle of bit vectors [bit vectors 
are one-dimensional arrays in which each element 
occupies one bit in the machine's memory]. 

The bit vectors are of two types, areas (APS) 

and projections (PSF) , as indicated by the BNF. The 

structure of each type is defined by a Lisp form, 

indicating the names of each field, and how many bits it 

should occupy. For example, the projection 

specification is defined as: 

(def-bit-vector PROJECTION-SPEC 
(radius-1 3) 
(radius -2 3) 
(radius-3 3) 
(connection-density 3) 
(target-address 3) 
(address-mode 1) 
(initial-eta 3) 
(eta-slope 3) ) 



wo 90/11568 



10 



15 



20 



25 



PCr/US90/00828 



- 25 - 

This form automatically defines the accessors 
needed to extract the value for each parameter from any 
given bit vector. The accessors transparently effect 
the gray coding and decoding fields. Most of the 
integral values of fields are interpreted through lookup 
tables; for example, an eta table translates the values 
o.,,7 to etas from o.i to 12.8. 

Genetic operators such as crossover and 
mutation directly modify this bit vector blueprint, 
which is considered the master plan for the individual. 
Pieces of it are actually shared with its offspring. 
The bit vectors are not directly useful in running an 
actual neural network, however. For this, the 
individual must be parsed, purified, and instantiated. 

When an individual is parsed, the bit string 
form of the blueprint is translated into a network of 
nodes-an area node for each area, and a projection node 
for each projection. Parsing works out the inter-area 
addressing done by projections, and the nodes carry 
parameter values interpreted from the associated bit 
vectors. The network, or parsed blueprint, is 
associated with the object representing individual. 

A parsed blueprint may have defects that 
prevent a meaningful interpretation as a neural 
network. For example, it might contain projections with 
no valid target, or projections indicating feedback 
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Circuits, which are prohibited in the current 
implementation. Rather than discarding slightly 
imperfect individuals, an attempt is made to patch them 
after parsing. The patching step is called 
5 purification. The purifier removes dangling nodes and 

cuts circuits in an attempt to create a viable 
individual while making as few changes as possible. 

Following parsing and purification, an 
individual is instantiated. Instantiation involves 
10 allocating and -initializing vectors for units, weight 

matrices, mask matrices, threshold vectors, and other 
numerical storage. References to these data objects are 
kept in the nodes of the individual's parsed blueprint. 

15 THE EVAZiUATXON PROCESS 

The purpose of the parse/purify/ instantiate 
sequence is to set the stage for the evaluation of the 
individual, i.e. the computation of a score. The score 
is a weighted sum of a set of performance metrics. The 

20 weights may be set by the user at run time. 

Some of these metrics are immediate 
consequences of instantiation, e.g. number of weights, 
nxxmber of units, number of areas, and average fan-out. 
Other metrics depend on the individual network's 

25 performance on a given problem (such as digit 
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recognition) . Examples of such metrics are: the 
learning rate of the network, its final performance on 
the training set, its performance on non-degraded inputs 
and on novel inputs, and its performance after 
temporarily mullifying a random sample of either the 
weights or units of the network. 

RESULTS, ANALYSIS AND DISCUSSION 

Despite the restricted scope of initial 
experiments, the method herein has produced reasonable 
networks, and has achieved significant improvements over 
the chance structures in its initial generation, in 
most cases, the networks produced have been structurally 
fairly simple. 

PERFORMANCE CRITERIA 

There are several common ways to look at the 
changes in population performance over time in genetic 
optimization systems, and most of our charts include 
four. Because our reproductive plan goes through 
separate phases of reproduction and evaluation, the data 
points are actually recorded at the end of each 
generation. 
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Define to be the score of the ith 
individual generated. The best function indicates the 
performance of the best individual discovered by the GA 
up to a given time, i.e. 
5 Best(i) = Max [Sj, j=l,...,i] 

The off-line GA performance is the mean of the 

best 

individual's scores found up to a give time: 

10 

, i 

Off-line (i) « ± S Best(j) 
^ j-1 

An alternative is the on-line performance. 
This is simply the mean of all individuals scores 
15 evaluated so far. At the end of time i, this, would be: 



On-line (i) = i s 

^ 3-1 ' 

20 Another interesting function is the average 

score for all of the individuals in a given generation. 
If is the set of individuals in the ith generation, 
then: 



25 



Average (i) = ^ E s. 



wo 90/11568 



10 



IS 



PCr/US90/00828 



- 29 - 



on-line performance is perhaps most relevant to 
systems that must interact with a real-time process, 
Whereas off-line performance is more relevant to systems 
that are concerned only with finding the best and not 
how much it costs to look. For example, if one were 
picking horses, it would be important to take into 
consideration all of the poor bets as well as the 
winners, motivating interest in on-line performance, if 
one were optimizing a function, the only concern might 
be about the quality of the best point tested, 
motivating off-line performance. Noting that the -Best" 
and "Offline" functions are isotone by definition they 
can only increase or remain constant over the course of 
an experiment, and c'annot decrease. 

EXPERIMENT 1 



20 



25 



Application: Digit Recognition 

Optimization Criterion: Area under learning 
curve 

Population Size: 30 
Generations: 60 
The average performance of the network population 
increased eight-fold from the first to the sixtieth 
generation. The network learned to criterion in 48 
epochs . 
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Since only one fac^tor was being directly 
optimized, others such as the niimber of weights were 
free to vary. The network had 1481 weights. A network 
which had exactly one weight connecting each input with 
5 each output would have only a third as many weights, 

such networks were also produced, and these learned 
perfectly as well, but took more than twice as long. 
The performance of this experiment for this 60 
generation experiment is summarized by Fig* 18. 

^0 In the initial generations, hidden-layer 

structures were present. It was not obvious to us that 
this problem is linearly separable until the experiment 
started producing two-layer structures that were 
learning perfectly, since hidden layers are not needed 

15 for this problem, and since learning rates in general 

degrade as hidden layers are added to a network 
(although this degradation is much less severe with the 
modified back-propagation rule we are using [Samad, 
1988] than with the original rule), towards the end of 

20 the simulation multiple-layer structures were rare. 

In order to evaluate the performance of the ga 
in discovering better networks, the digit recognition 
problem was repeated with the GA disabled. To achieve 
this, random individuals were generated where crossover 

25 or mutation would have been applied. Again, scores were 
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based exclusively on the area under the learning curve. 
The results of this experiment are charted in Pig. 19. 

While the random search and ga experiments 
started with a very similar populations in generation 0, 
the performance of the two algorithms soon diverged, m 
particular, average and on-line performances of the 
random search algorithm were conspicuously inferior to 
the GA. This is to be expected if the GA is successful 
in retaining some of the better characteristics from one 
generation to the next; the random search procedure is 
confined to picking "losers" at a fairly constant rate. 
The off-line performance is arguably a more interesting 
comparison for this problem between the GA and random 
search. Figure 20 shows off-line performances extracted 
15 from Figs. 18 and 19. 

once again, the GA performance dominates random 
search for the duration of the experiment, it could be 
argued that the gap is not a large one but, as stated, 
the scores are normalized. The best network discovered 
(by chance) after 60 generations took 67 epochs to learn 
the problem while the best network discovered by the GA 
learned the problem in 40 epochs. Further, it seems 
likely that we will be able to improve the performance 
of the GA through altered representation and better 
parameter values, while there is no latitude for 
improvement in the performance of the random search 



wo 90/11568 PCT/US90/00828 

- 32 - 

procedure. Finally, a caveats we are running with a 
relatively small population, and our experiments have 
been limited to few generations-all of these results 
should therefore be interpreted with caution - 

5 

EXPERIMENT 2 

Application: Digit Recognition 
Optimization Criteria: Average fan-out and 
percent correct 
^0 Population Size: 30 

Generations: 20 

In this experiment, the criteria were the 
average fan-out and percentage correct, equally weighted 
(0.5). Learning rate was not given any direct influence 

15 on the score. The percentage of correct digit 

identifications after training was determined by 
presenting each of the ten digits to the trained network 
and scoring a "Hit" if the output unit with maximal 
value corresponded to the correct digit. The average 

20 fan-out is defined as the ratio of the number' of weights 

to number of units; this metric is normalized and 
inverted, so that a large ratio of weights to units will 
detract from an individual's score. The question posed 
by this experiment is, can the system improve 

25 performance by limiting fan-out? It is a potentially 



wo 90/11568 



10 



15 



20 



25 



PCT/US90/00828 

- 33 - 



interesting question to designers of neural network 
hardware, since high fan-outs are difficult to engineer 
in silicon. [Average fan-out is an approximation of an 
even more interesting quantity-maximal fan-out.] our 
initial results are shown in Fig. 21. 

The average fan-out in this experiment was 
157/48 - 3.27. This can be contrasted with the network 
shown for Experiment 1. which has an average fan-out 
that is almost an order of magnitude higher. 

Learning was quite slow, m fact, the above 
network did not learn to within the error threshold that 
was prespecified as a termination criterion for 
training. (Learning to within the error threshold is 
not necessary to achieve perfect hit rates.) The 
connectivity structure of the network uses large 
receptive fields but low connection density. From a 
hardware implementation perspective, it would be better 
to optimize for small receptive fields and such an 
experiment is contes^lated. 

METRIC FOR LEARNING RATE 

The metric chosen for learning rate requires 
some explanation. Because of limited computational 
resources, we cannot hope to train all networks until 
they achieve perfect accuracy on a given problem, or for 
that matter to any non-zero predetermined criterion, m 
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some eases, a network xnay require a hundred epochs while 
in others a million may be insufficient, our compromise 
is to employ two criteria for halting the learning • 
phase. Learning is halted under the first criterion 
5 when rms error during the previous epoch was lower than 

a given threshold. The learning phase is terminated 
tinder the second criterion after a fixed number of 
epochs has been counted; this threshold is set by the 
eieperimenter according to the problem, but it is 

10 typically between 100 to 5000 epochs, we nonetheless 

wish to compare all individuals on the same learning 
rate scale even though their training may have lasted 
different numbers of epochs and resiated in different 
final levels of accuracy, our approximation is to 

15 integrate the rms error curve over the learning phase 

for each individual. This "area under the learning 
curve" provides a rank that corresponds closely to our 
intuition about learning rate scales. Lower numbers 
imply better performance. 
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Appandlx A 

syntax tor Blueprint R«pr..«nt*tion in W. 

<blu.prtm^p^::- <,„pu...p^> <ou.puMp«> 

<input-«pto .:- <area.spM> «pro,«non-9p«:> 
<l«aiTung-ruto.tp«c> 



<piO!«etio«-iittrfctr><piojte»ioi».«pte-«tM> I 
<pre|tetien<«p«e> 

<preiKdon.iMrk«> <pfojtettoiMpte^|tfd> 
<ma-spw> <preiKtieit^pto 
<«m«markif9 ::■ empty 

<arM<id> ::■ <bin«ry*digits> 

<dliMMien.wM«ki> M <toui.rt«><dlm.«pte><dteMp,e> 

<diiMp«e> 

<tu.iniHal.v«lu«> <»lop«^f<lunging.,ta> 

coipcy 



<«dii^f-conn«tlvHy> <coiii»cttoii^«iity> 
<tafg«|.«ddrtts> <ttf(it-4itfdrts»-modt> 
<t«aniinf*ruto-4p«c> 
<Wnaiy^gilp > <biiury^lgit> I <biiury^igiis> <b4iury-disit> 
rupiMi Uuuiid>M <biMry-dlgit> <biMry«dlgli> <biiuiry-dlgit> 
<dlm-tpK> » <biMfy^git> <MMfy^igit> <biiufy^igit> 
<«tft-iiiitial-vaiiM> itm <btMry^igii> <bimry-<tigit> <biMfy^igit> 
<slopMl<huifliif.M> iim <b4iuryHlijii> <biMry-digit> <binafy^igit> 
<ndtt-cf<oiiMaivlt]f> :» <ndiiis-of<onntction> <ndiiiMl<onMeteti> 

<nidiui-of-coiiMetion> 

<biMiy^gil> <bln«fy^lgit> ^blftMy-<ligit> 
<biiMry-digit> <UMfy-digii> <bifMry-digit> 
<tar8it*«ddf«M> <btiury-digii> <biiury4igii> <biiufy^lsic> 
<tar|tt««ddm».Riod«> <binary*4igii> 
<biiiary-digit> Oil 
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APPENDIX B 
Badcpropagation 

Neural networks are constructed from two 
primitive elements: processing units and (directed) 
connections between units. The processing units are 
individually quite simple, but they are richly 
interconnected. Each connection typically has a 
real-valued weight associated with it, and this weight 
indicates the effect the value of the unit at the source 
of the connection has on the unit at its destination. 
The output of a unit is some fraction of the weighted 
sum of its inputs: 



""j^^^i ^ij°i-^j> (1) 

Where Oj is the output of unit j, w^j is the 
weight from unit i to unit j, and ej is the 
"threshold" or bias weight for unit j. The quantity 
iw^jOj^- Oj) is usually referred to as the net 
input to unit j, symbolized net j . The form of Eq. (l) 
that is usually employed with ^back-propagation is the 
sigmoid function: 

f (X) « (2) 
1 + e ^ 
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in most backpropagation networks, the units are 
arranged in layers and the networks are constrained to 
be acyclic. It can be shown that such "multi-layer 
feed-forward" networks can realize any mapping from a 
multi-dimensional continuous input space to a 
multi-dimensional continous output space with 
arbitrarily high accuracy (Hecht-Nielsen, 1987; 
Lippmann, 1987; Lapedes & Farber, 1988) . 

The rule used to modify the weights is; 

A*ij - "Oi* j 

This is the standard backpropagation learning 
rule. Here w^j is the weight from unit i to unit j, 
Oi is the output of unit i, , is a constant that 
determines the learning rate, and «j is the error 
term for unit j. fij is defined differently for 
units in the output area and for units in "hidden" 
areas. For output units. 



ffj- Oj'(tj-Oj) 



25 



Where Oj' is the derivative of Oj with respect to 

its net input (for the activation function of Eq. (2), 

this quantity is Oj(l-Oj)) and tj is the target 
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value (the "desired output") for unit j. For hidden 
units, the target value is not known and the error term 
is computed from the error terms of the next "higher" 
layer: 



S « o '23 w S 
j j k jk k 

We have incorporated two extensions to most 
uses of backpropagation in our current implementation. 
10 First, we use a. recently discovered improvement of Eg. 

(3) (Samad, 1988) : 



This equation uses the anticipated value of the 
source unit of a weight instead of the current computed 
value. In some cases, orders of magnitude faster 
learning is achieved. 

Second, we allow the value of n to decrease as 
learning proceeds. That is, i; is now a variable, 
and the learning rule actually used is: 



^ ij* ''t^^i 

25 

f 
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Where r,^ is the value of , at the tth 

iteration through the training set. At the end of. each 
iteration, r, is changed according to the following 
formula: 



"t+1 " 'slope 



where n slope is a parameter that determines the 
rate of decay of It has been experimentally 
observed that using a high value of , initially and 
then gradually decreasing it results in significantly 
faster learning than using a constant „. Both t, 
slope and the initial value of q {j,^) are 
given by the projection specification in the blueprint. 
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I't is claixned: 

1- A method for synthesizing designs for 
neural networJcs which involves the use of a selected 
learning algorithm and a particular subject to be 
learned, comprising the steps of: 

A. devising a bit string parametric 
representation of a neural network architecture having 
relevant pareuaeters, 

B. generating a first generation of network 
blueprints based on said representation which jointly 
include a range of values for each of said parameters, 

C. generating respective neural network 
architectures based on the current generation of said 
blueprints, 

D. training each of said network architectures 
presently defined in step C via said selected learning 
algorithm and said subject matter, 

E. testing each of said network architectures 
presently defined in step c with test patterns 
corresponding to said sxibject matter for testing the 
receptiveness of each of said network architectures 
presently defined in step C to the affect of said 
training. 

F. performing an evaluation for each of said 
network models presently defined in step c after said 
testing thereof relative to performance and cost factors 
of interest and assigning a score thereto representing 
the results of said evaluation. 
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G- selecting candidates from said network 
architectures presently identified in step c in 
accordance with some rationale and applying at least one 
operator thereto to product a new generation of network 
blueprints which shall be identified as the current 
generation of network blueprints based on said 
representation, and 

H. returning to step c and continuing the 

process. 

2. A method according to claim 1 wherein said 
operator is a genetic operator. 



3. A method for synthesizing designs for 

IS neural networks each of which comprise, 

a plurality of computational units, a plurality 
of hierarchically arranged layer areas including input 
and output layer areas and zero or more -hidden layer 
areas therebetween, each of said layer areas being 

20 defined by a number of said units, connecting means 

connecting source groups of said units in said layer 
areas other than said output layer area with object 
groups of said units in said layer areas other than said 
input layer area, said connecting means being grouped in 

25 sets deemed projections with each of said projections 

extending from one of said layer areas to another of 
said layer areas. 
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said method comprising the steps of: 
A, providing a sxibstring format for specifying 
each of said layer areas with said format having one 
first type part deemed a layer area parameter specifier 
5 and at least one second type part for each of said 

projections deemed a projection specifier, 

said first type part comprising a layer area 
identifying address section , a total size section 
denoting the corresponding number of said units thereof, 
10 and a dimension section denoting the configuration 

formed by said tinits, 

each said second type part being dedicated to 
one of said projections deemed a subject projection, 
said second part type comprising a target address 
15 section for identifying one of said layer areas deemed a 

target layer area to which said subject projection is 
directed, a mode of address section for said subject 
projection, a dimension section for -denoting the 
configuration of an object field for said subject 
20 projection in said target layer area, a connection 

density section for denoting the connectivity of said 
subject projection to said object field, and at least 
one learning rule parameter section. 

devising a bit string parametric 
25 representation of a neural network architecture based on 

said substring format and having relevant parameters. 
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C. generating a first generation of network 
blueprints based on said representation which jointly 
include a range of values for each of said parameters, 

D. generating respective neural network 
architectures based on the current generation of said 
blueprints, 

E. training each of said network architectures 
presently defined in step D via said selected learning 
algorithm and said subject matter, 

P. testing each of said network architectures 
presently defined in step D with test patterns 
corresponding to said subject matter for testing the 
receptiveness of each of said network models presently 
defined in step D to the affect of said training, 

®- performing an evaluation for each of said 
network architectures presently defined in step D after 
said testing thereof relative to performance and cost 
factors of interest and assigning a score thereto 
representing the results of said evaluation. 

H. selecting candidates from said network 
architectures presently identified in step D in 
accordance with some rationale and applying at least one 
genetic operator thereto to produce a new generation of 
network blueprints which shall be identified as the 
current generation of network blueprints based on said 
representation, and 
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I. returning to step D and continuing the 

process. 
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